Statistical analysis of co-occurrence patterns in microbial presence-absence datasets
نویسندگان
چکیده
Drawing on a long history in macroecology, correlation analysis of microbiome datasets is becoming a common practice for identifying relationships or shared ecological niches among bacterial taxa. However, many of the statistical issues that plague such analyses in macroscale communities remain unresolved for microbial communities. Here, we discuss problems in the analysis of microbial species correlations based on presence-absence data. We focus on presence-absence data because this information is more readily obtainable from sequencing studies, especially for whole-genome sequencing, where abundance estimation is still in its infancy. First, we show how Pearson's correlation coefficient (r) and Jaccard's index (J)-two of the most common metrics for correlation analysis of presence-absence data-can contradict each other when applied to a typical microbiome dataset. In our dataset, for example, 14% of species-pairs predicted to be significantly correlated by r were not predicted to be significantly correlated using J, while 37.4% of species-pairs predicted to be significantly correlated by J were not predicted to be significantly correlated using r. Mismatch was particularly common among species-pairs with at least one rare species (<10% prevalence), explaining why r and J might differ more strongly in microbiome datasets, where there are large numbers of rare taxa. Indeed 74% of all species-pairs in our study had at least one rare species. Next, we show how Pearson's correlation coefficient can result in artificial inflation of positive taxon relationships and how this is a particular problem for microbiome studies. We then illustrate how Jaccard's index of similarity (J) can yield improvements over Pearson's correlation coefficient. However, the standard null model for Jaccard's index is flawed, and thus introduces its own set of spurious conclusions. We thus identify a better null model based on a hypergeometric distribution, which appropriately corrects for species prevalence. This model is available from recent statistics literature, and can be used for evaluating the significance of any value of an empirically observed Jaccard's index. The resulting simple, yet effective method for handling correlation analysis of microbial presence-absence datasets provides a robust means of testing and finding relationships and/or shared environmental responses among microbial taxa.
منابع مشابه
Demonstrating microbial co-occurrence pattern analyses within and between ecosystems
Co-occurrence patterns are used in ecology to explore interactions between organisms and environmental effects on coexistence within biological communities. Analysis of co-occurrence patterns among microbial communities has ranged from simple pairwise comparisons between all community members to direct hypothesis testing between focal species. However, co-occurrence patterns are rarely studied ...
متن کاملA probability-based analysis of temporal and spatial co-occurrence in grassland birds
To date, most studies of species co-occurrence have involved the analysis of static species presence–absence matrices. These analyses are often contentious: researchers sometimes disagree about how to construct the randomized matrices that are required for testing the statistical significance of observed co-occurrence patterns (Gotelli, 2000). There are also a variety School of Biological Scien...
متن کاملInvestigating species co-occurrence patterns when species are detected imperfectly
1. Over the last 30 years there has been a great deal of interest in investigating patterns of species co-occurrence across a number of locations, which has led to the development of numerous methods to determine whether there is evidence that a particular pattern may not have occurred by random chance. 2. A key aspect that seems to have been largely overlooked is the possibility that species m...
متن کاملPathogens in water: value and limits of correlation with microbial indicators.
This article discusses the value and limitations of using microbial indicators to predict occurrence of enteric pathogens in water. Raw or treated sewage is a primary source of fecal contamination of the receiving surface water or groundwater; hence, understanding the relationship between pathogens and indicators in sewage is an important step in understanding the correlation in receiving water...
متن کاملDeciphering microbial interactions and detecting keystone species with co-occurrence networks
Co-occurrence networks produced from microbial survey sequencing data are frequently used to identify interactions between community members. While this approach has potential to reveal ecological processes, it has been insufficiently validated due to the technical limitations inherent in studying complex microbial ecosystems. Here, we simulate multi-species microbial communities with known int...
متن کامل